Skip to content

Conversation

@stephenswat
Copy link
Member

CUDA 13.0 enables the PTX assembler to spill registers to shared memory instead of local memory, which should both be much faster, and also reduce the local memory usage of our fitting and finding kernels which are currently bottlenecking our throughput.

CUDA 13.0 enables the PTX assembler to spill registers to shared memory
instead of local memory, which should both be much faster, and also
reduce the local memory usage of our fitting and finding kernels which
are currently bottlenecking our throughput.
@stephenswat stephenswat added the performance Performance-relevant changes label Aug 20, 2025
@sonarqubecloud
Copy link

@stephenswat
Copy link
Member Author

I'm not 100% certain this works as intended like this, as this pragma is to be attached at the function scope. But we can try.

@beomki-yeo
Copy link
Contributor

beomki-yeo commented Aug 20, 2025

This is interesting as we are not actively using the shared memory in our finding and fitting kernels.
On the contrary, I hope the compiler is smart enough not to overuse the shared memory as this can reduce the number of concurrent blocks. (if we can limit the usage of shared memory from register spilling it would be great) Please let us know if there is any noticeable performance change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-relevant changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants